The unit ratio-extended Weibull family and the dropout rate in Brazilian undergraduate courses

We propose a new family of distributions, so-called the unit ratio-extended Weibull family (UREW). It is derived from ratio transformation in an extended Weibull random variable. The use of this transformation is a novelty of the work since it has been less explored than the exponential and has not yet been studied within the extended Weibull class. Moreover, we offer a valuable alternative to model double-bounded variables on the unit interval. Five UREW special models are studied in detail, namely the: i) unit ratio-Gompertz; ii) unit ratio-Burr XII; iii) unit ratio-Lomax; v) unit ratio-Rayleigh, and vi) unit ratio-Weibull distributions. We propose a quantile-parameterization for the new family. The maximum likelihood estimators (MLEs) are presented. A Monte Carlo study is performed to evaluate the behavior of the MLEs of unit ratio-Gompertz and unit ratio-Rayleigh distributions. This last model has closed-form and approximately unbiased MLE for small sample sizes. Further, the UREW submodels are adjusted to the dropout rate in Brazilian undergraduate courses. We focus on the areas of civil engineering, economics, computer sciences, and control engineering. The applications show that the new family is suitable for modeling educational data and may provide effective alternatives compared to other usual unit models, such as the Beta, Kumaraswamy, and unit gamma distributions. They can also outperform some recent contributions in the unit distribution literature. Thus, the UREW family can provide competitive alternatives when those models are unsuitable.


Introduction
The formulation of new generalized classes of probability distributions is a topic that has received a great deal of attention in recent years, particularly when it comes to positive data [1].To mention a few, we refer the reader to [2][3][4] as extensions of the Weibull distribution and [5,6] for Nadarajah-Haghighi generalizations.Most of these works are introduced aiming to furnish more flexible distributions regarding shape densities and hazard rates.However, there is much to be done when considering random variables supported in the unit interval.We can cite the beta and Kumaraswamy [7] (KW) distributions as classical unit models In this respect.rates tend to receive higher quality ratings [34].In addition, this measure is seen as an indicator of institutional excellence and performance [35].Therefore, our proposals have the advantage of providing consistently better fits than classical beta and KW distributions when modeling the dropout rate in Brazilian undergraduate courses (see Section 6).As illustrated in the applications, they can also outperform some recent contributions in the distribution literature, such as the UBS, UW, and CUW distributions.All analysis in this paper is carried out using R programming language.The computational codes and data sets used to obtain the plots, simulations, and application results are made available on a GitHub repository (Computer codes available at https://github.com/penaramirez/UREW).
The rest of the paper is organized as follows.Section 2 presents the theoretical background and defines the new family of unit distributions.Some UREW special cases are presented in Section 3. Section 4 focuses on inferential procedures based on the maximum likelihood method.We present results for all family members and derive expressions for the MLEs of some special models.Section 5 discusses simulation studies' results to assess the performance of the point and asymptotic interval estimators.Section 6 illustrates our proposed family's relevance in educational data, specifically about the first-year dropout rate in some Brazilian undergraduate courses.The final remarks are presented in Section 7.

The unit ratio-extended Weibull family of distributions
This section presents the theoretical background and defines the proposed family from a ratio transformation in the EW class of distributions.Let X be a random variable on the EW class, and denote X � EWða; ξÞ.The probability density function (pdf) of X is given by gðxÞ ¼ ahðx; ξÞ exp ½À aHðx; ξÞ�; ð1Þ where x > 0, α > 0, H(x; ξ) is a non-negative monotonically increasing function which depends on the parameter vector ξ, and h(x; ξ) is the derivative of H(x; ξ) with respect to x.For each formulation of H(x; ξ), different EW special models result.Thus, several well-known distributions can be obtained depending on the choice of this function.Table 1 presents twenty alternatives for H(x; ξ), their corresponding derivatives, and inverse functions.Further details on this family and some generalizations to examine non-negative data are given by [36][37][38].
The EW cumulative distribution function (cdf) and quantile function (qf) are given by GðxÞ ¼ 1 À exp½À aHðx; ξÞ�; and respectively, where H −1 (�; ξ) is the inverse function of H(�; ξ).We define the UREW class of distribution by considering the ratio transformation Y = X/(1 + X), where X � EWða; ξÞ: Hereafter, we denote Y as a UREW random variable, which has cdf where 0 < y < 1, α > 0, and ξ is the parameter vector associated to the H(�; ξ) function.Thus, the pdf and qf of the proposed family are and respectively.The proposition below refers to a quantile-based parametrization for the UREW family.Analogous frameworks can be found in other unit models recently introduced.See [14,39,40] for median-based parametrizations and [41] for a quantile-based example.Proposition 1.Let Y be a UREW random variable, then its cdf can be rewritten as where q(τ) 2 (0, 1) is a location parameter which corresponds to the τth quantile of Y, ξ is the parameter vector associated with H(�; ξ), and τ is assumed as known.
Proof.The result in Eq (3) holds by replacing a ¼ logð1 À tÞ Hence, the qf Y can be rewritten as Setting u = τ in the above equation, we obtaing that Q Y (τ) = q(τ), which concludes the proof.Under the quantile parametrization, the UREW pdf can be written as

H(x; ξ) H
3 Some UREW special cases Several well-established statistical models are special cases in the EW family.They can be considered baseline models in the UREW family by replacing their corresponding H(�; ξ) functions in the cdf (2).Here, we give further details on five of those models, namely: the unit ratio-Gompertz (URG), unit ratio-Burr XII (URBX II), unit ratio-Lomax (URL), unit ratio-Weibull (URW), and unit ratio-Rayleigh (URR) distributions.These models are introduced using the quantile-parametrization given in Proposition 1.The H(�; ξ) functions of these and several other models members of the UREW family can be consulted in Table 1.

The unit ratio-Gompertz distribution
The URG distribution is obtained considering the Gompertz as a baseline model in the UREW family.Thus, by taking H(x; ξ) = exp{βx} − 1 in (2), the URG cdf can be written as where y 2 (0, 1), β > 0 is a shape parameter, and μ 2 (0, 1) is the τth quantile parameter.The corresponding pdf, qf, and hazard rate function (hrf) are and

The unit ratio-Burr XII distribution
The URBX II distribution is obtained considering the Burr XII as a baseline model in the UREW family.Thus, by taking H(x; ξ) = log[1 + x β ] in (2), and after simplification, the URBX II cdf reduces to where y 2 (0, 1), and μ 2 (0, 1) is the τth quantile parameter.The corresponding pdf, qf, and hrf are and

The unit ratio-Lomax distribution
The URL distribution is obtained considering the Lomax as a baseline model in the UREW family.Thus, by taking H(x; ξ) = log[1 + x β ] in (2), and after simplificaion, the URL cdf reduces to

The unit ratio-Weibull and unit ratio-Rayleigh distributions
The URW distribution is obtained considering the Weibull as baseline model in the UREW family.By taking H(x; ξ) = x β in (2), the URW can be written as where y 2 (0, 1), β > 0 is a shape parameter, and μ 2 (0, 1) is the τth quantile parameter.The corresponding pdf, qf, and hrf are and respectively.For β = 2, the URW reduces to the URR distribution, which is also new.The URR is a one-parameter model obtained considering the Rayleigh as a baseline model in the UREW family.Its pdf is given by

Maximum likelihood estimation
Here, we consider estimation of the parameters of the UREW family by the maximum likelihood (ML) method.The log-likelihood for a random sample y 1 , . . .y n from (4), based on parameter vetor θ = (μ, ξ > ) > , is The components of the score vector For fixed values of ξ, it is possible to obtain a closed-form for the MLE of the μ.By setting U μ = 0 and solving for μ, we have Therefore, obtaing the EMV of μ in closed-form is possible when ξ = ;.Otherwise, to get the MLEs of the parameters μ and ξ, it is necessary to use some iterative procedures such as Newton-Raphson type algorithms to maximize (12).We can construct approximate confidence intervals for θ based on the asymptotic normality property.Under standard regularity conditions, the asymptotic distribution of θ À θ can be approximated by the multivariate normal N ð0; Jð θÞ À 1 Þ distribution, where Jð θÞ À 1 ¼ À @'ðθÞ=@θ@θ > j θ¼ θ is the observed information matrix.Thus, the asymptotic 100(1 − η)% confidence intervals of θ are given by where z η/2 is the quantile η/2 of the standard normal distribution, and c varð θÞ In what follows, we present the likelihood estimation of some special cases of the UREW family.

MLE for the URRðμÞ distribution
Let y 1 , . .., y n be a random sample of size n from the URRðmÞ distribution.The log-likelihood function is The escore function U μ is By setting U μ = 0 and solving for μ, we have the EMV of μ as m ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi logð1 À tÞ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi logð1 À tÞ and the Fisher's observed information is computed as The conditions for the maximum value of the function ℓ(μ|y 1 , . .., y n ) require that IðmÞ < 0. This is easily observed by substituting ( 14) into (15), where it is verified that ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi logð1 À tÞ

MLE for the URGðβ; μÞ distribution
Let y 1 , . .., y n be a random sample of size n from the URG distribution with parameter vector θ = (β, μ) > .The log-likelihood function is The components of the score vector and Note that the system of equations U θ = 0 cannot be solved in closed form; therefore, the maximization of ( 16) to obtain the EMV of θ can be carried out using the quasi-Newton BFGS nonlinear optimization algorithm implemented in the optim function available in R.

Simulation study
In this Section, a Monte Carlo study is carried on to evaluate the performance of the MLEs of the UREW family in finite samples.For that, the URR and URG distributions are considered.This study conducted 10,000 Monte Carlo replications with sample sizes n 2 {10, 25, 50, 75, 100}.Aiming to evaluate the point estimators, we use the set of estimates of parameters obtained in each replication to calculate its mean, variance, relative biases (RBs), standard deviations (SDs), and root mean squared errors (RMSEs).Regarding the initial values selected for simulation, we highlight that the URR distribution has a closed form for its MLE (see Eq (14)).Therefore, one advantage of using this model is that it does not require defining initial values in the ML method.For the two-parameter UREW special cases, we use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm and compute the observed information matrix numerically from the optim function in the R programming language.Therefore, we set the sample quantile as the initial value for μ and one for the shape parameter.These values are used either for the simulated or actual data experiments performed in the paper.We calculate the coverage probability of the 95% pointwise confidence interval (CP 95% ) to evaluate the interval estimation.Next, we provide the numerical results for both considered distributions.Next, the numerical results for both distributions considered are presented.

Numerical Analysis for the URR distribution
We generate occurrences of the variable Y following a URR law with five different values of μ (scenarios).For that, we use the inversion method replacing u * U(0, 1) in the URR qf.The simulation results are shown in Table 2.It reveals low RB values in all the scenarios and sample sizes considered.We highlight that all its observed values are less than 0.7%.We also observe low SD values, all less than 0.5.For all the sample sizes, it is common to observe RMSE's lower values for the central values of μ (μ = 0.4, for example) than for the close values of the extremes (μ = 0.15 or μ = 0.9, for example).In its last column, it can verify that the coverage probabilities of the 95% pointwise confidence intervals of the parameter are quite close to the nominal level.

Numerical analysis for the URG distribution
Analogous to the previous experiment, occurrences of the variable Y are initially generated, which follows a URG distribution with different configurations in its parameters μ and β.The data are generated using the inversion method in the URG qf.In Table 3', we present the simulation results.It shows that μ's estimates are more accurate than β's.We can also observe that the RB of μ is always less than 0.4% in absolute value.For sample sizes greater than 75, the RB of b is always less than 10%.In the last column of Table 2, we can be observed that the coverage probabilities of the 95% pointwise confidence intervals of both parameters are quite close to the nominal level.

Applications
This section illustrates the usefulness of the UREW family through applications in educational data related to student dropout, also known as student attrition.This outcome has some complexity in data collection [42], and a diversity of definitions has been considered in the specialized literature.In this paper, we are interested in analyzing the first-year dropout rate in undergraduate courses, defined as the proportion of students who withdraw from the course before completing the first year.Thus, from a sample with n undergraduate courses, the ith observation is obtained as number of freshmen students who dropped out the ith course before completing the first year number of freshmen students enrolled in the ith course ; where i 2 {1, . .., n}.The decision to focus the study on freshmen students lies in the evidence that the risk of dropping out is higher during the first year of college, also called the freshmen year [42,43].This period is seen as the most critical time for the connection between academic programs and students [44].Therefore, understanding the behavior of this variable may be helpful in developing practices aimed at reducing the early dropout from undergraduate courses from different areas.
The data used in this case study were collected from the Brazilian higher education census microdata, conducted in 2018 [45] and were calculated from the entering students in 2018.We select the presential courses with more than 29 new students and first-year dropout rate in the (0, 1) interval in the census academic year.The applications refer to four data sets about civil engineering, economics, computer sciences, and control engineering courses.We fit UREW special models and compare their performance with other existing double-bounded distributions, which are not special cases of the proposed family.
Table 4 gives a descriptive summary of the dropout rates of each dataset considered.The Economics course exhibits smaller values for all central tendency measures and higher for the skewness, kurtosis, and amplitude measures.The other courses present those measures quite close when compared with each other.Their mean and median are around 17% and 16%, respectively.The descriptive measures indicate that, for all data sets, the mass of observations concentrates on the left.This configuration is adequate since the dropout rate is negatively related to institutional quality and effectiveness.Academic programs with low dropout rates are often considered to be more efficient [34].Nevertheless, the dropout rates in higher education are social and institutional concerns [42], and there is a broad consensus on the need for  universities to promote students' success [46].The fact that many students do not achieve their goals during university experience is a waste of talent and human potential [42,46].For modeling these data, we fit five UREW special models studied in the current paper, i.e., the URG; URBX II; URL; URR; and URW distributions.Their densities are given by equations ( 5), ( 7), ( 9), (11), and (10), respectively.We fix τ at 0.5 in those equations.We also considered six well-known alternative distributions to describe random variables supported in the unit interval for comparison purposes.We fit the Beta, KW, UBS, UG, UW, and complementary unit Weibull (CUW) [14] distributions.They do not represent UREW special cases and are selected as competitive distributions due to their relevance in the literature.The beta and KW are classical models for double-bounded outcomes.The UG is chosen due to its relevance to various problems.It has received a great deal of attention from statisticians for developing methodological advances [47].The UBS and UW are two of the most relevant models regarding recent advances in distribution theory.The CUW arises as an alternative model due to its usefulness regarding educational modeling.This distribution has proved helpful in analyzing literacy rates [14].The densities of all these competitor models are presented in Appendix A.
Parameter estimation is performed by the maximum likelihood method for all fitted models, and the Crame ´r-von Misses corrected statistic [48] (W*) is considered as the goodness-offit measure.Those estimates are computed using the goodness.fitfunction from the AdequacyModel package [49].The goodness.fitfunction allows computing the MLEs of probability distributions and their goodness-of-fit statistical measures.It uses the optim function in the implementation and includes several optimization techniques.For the paper results, we use the BFGS algorithm and compute the observed information matrix numerically.Thus, the standard errors and confidence intervals were obtained from the asymptotic normality property of the MLEs.We set the initial values at 1 for the shape (or precision) parameter, the sample mean for the distributions indexed in the mean, and the sample quantile for those with quantile parametrization.
The estimation results for all data sets are reported in Table 5.We observe that the distributions on the UREW family have the lowest W* for the course types considered.The proposed models occupy the first three positions in the ranking for civil engineering and computer sciences.Analyzing the control engineering course, we note that the URW outperforms the others and is followed by the URL distribution, which also belongs to the UREW family.For the economics course, the URL distribution has superior goodness-of-fit.Fig 4 displays the boxplots and the histograms with fitted density functions for the three best models according to W*.Those plots corroborate that the UREW fits are adequate to the dropout rates of all course types considered and provides real improvement over existing distributions.Therefore, the proposed family is shown competitive with classical unit models such as the beta and KW distributions.
The UREW special cases also exhibit superior performance when compared to recent alternatives, including the UBS, UW, and CUW distributions.It is worth noting that the CUW distribution has been commonly used in educational modeling.In [14], it was verified that this model can properly fit literacy rates.However, it is important to highlight that while higher literacy rates are desirable [14], lower values are considered more favorable in the case of dropout rates [17].In this case, it is expected that left-skewed distributions to fit better the former and right-skewed distributions to be more suitable for the latter.This feature may explain why the CUW is not among the best models for the analyzed datasets while evincing the capacity of the UREW family to model the first-year dropout rate effectively.
Our results may represent useful tools for universities to evaluate and improve their programs.It is a relevant application as it allows us to deal with the academic, social, and economic implications of university dropout [17].Nevertheless, other potential applications can be explored in the context of educational modeling.The new family can be competitive to model literacy rates [14], educational attainment percentages [31], proportions of adolescents who want top grades at school [32], and proportions of the novice teachers with a mentor at the school [33].These variables have been explored through other commonly used distributions in educational modeling.We can also cite the graduation and persistence rates as further applications, which are related to student progression and academic success patterns [34].

Final remarks
This paper defines the unit ratio-extended Weibull (UREW) family of distributions.It is obtained on a ratio transformation in the extended Weibull family and can be used to model continuous random variables in the unit interval.The new family has a closed-form for quantile measures; thus, we provide a quantile parametrization for the family.Several special cases are derived, and parameter estimation is explored using the maximum likelihood theory.We show that some one-parameter UREW special cases may present closed-form for the maximum likelihood estimator (MLE).We perform Monte Carlo experiments to assess the performance of those estimators.For example, the unit ratio-Rayleigh MLE is approximately unbiased for small sample sizes.We also note an appropriate performance for the unit ratio-Gompertz MLEs.The utility of the proposed family is illustrated with applications to the firstyear dropout rate of undergraduate courses in Brazilian universities.We select four course types and note that, for those data, the UREW special models fit properly and outperform other classical and recent unit distributions.Thus, the new family can be competitive alternative when those models are unsuitable.We emphasize that a long list of possibilities can be addressed in future works.For example, our approach can be investigated in the presence of zeros and ones, and quantile regression models are also a natural path.The UREW can also be generalized to accommodate time-dependent double-bounded indicators by using the autoregressive moving average models.This kind of structure is in the state-of-art literature on the analysis of double-bounded time series.The UREW can also attract applications to other double-bounded variables, being a competitive option to other unit distributions commonly used in educational modeling.For instance, literacy rates, educational attainment percentages, graduation, and persistence rates are educational measurements that represent potential applications for the proposed family.
• The UG density is given by where μ 2 (0, 1) is the mean of Y and ϕ > 0 is a precision parameter.The above parametrization is pioneered by [52].
• The UBS density is given by where α > 0 and β > 0 are shape parameters.The UBS is pioneered by [11].
• The UW density is given by where μ 2 (0, 1) is the τth quantile parameter and β > 0 are shape parameters.The above parametrization is pioneered by [41].In Section 6 we fix τ at 0.5 thus the parameter μ refers to the median of Y.
• The CUW density is given by where μ 2 (0, 1) is the median of Y and β > 0 is a shape parameter.The above distribution is pioneered by [14].

respectively. Fig 1
(a) illustrates the URG pdf shapes for several combinations of μ and β, with τ = 0.5.

Fig 1 (
Fig 1(d) illustrates the URR pdf shapes for several combinations of μ and τ = 0.5.It shows that the URR distribution presents unimodal density shape, accomodating left and right-skewed data in the unit interval.

Fig 2
indicates that the RB and RMSE of m decrease as the sample size increases, corroborating the asymptotic properties of the MLEs.

Fig 3 (
Fig 3(a) presents a plot with the sum of the RB of m and b that we call the total RB.Fig 3(b) presents a similar plot with the sum of the RMSE of m and b, that we call the total RMSE.They show that the total RB and total RMSE of m and b decrease as the sample size increases, corroborating the asymptotic properties of the MLEs.

¼ b y log t log m � � log y log m � � bÀ 1 t
ðlog y=log mÞ b ;